Skip to content

Conversation

@elad335
Copy link
Contributor

@elad335 elad335 commented Oct 31, 2025

Optimizations:

  • cellRtc functions were called non-stop by the PPU thread which use sys_memory_get_page_attribute internally.
    The writer lock in sys_memory_get_page_attribute was causing SPUs to wait unjustly.
  • Fast path without locking VM has been added to sys_rsx_context_iomap.
  • Exclude spu_thread::reservation_check address receptacle from writer_lock detection and waiting.
  • Fix spu_thread::reservation_check(hash) overload for main and stack memory.
  • Add fast path without checking allocation for spu_thread::reservation_check when the address is on the same page as GETLLAR's effective address.

Fixes #14724

@elad335 elad335 added CPU Bugfix Optimization Optimizes existing code labels Oct 31, 2025
@elad335 elad335 requested a review from kd-11 October 31, 2025 10:02
@elad335 elad335 force-pushed the gamer branch 3 times, most recently from a6a2fc0 to 9f8c31f Compare November 2, 2025 07:28
@elad335 elad335 changed the title vm/sys_memory: Remove VM locking in sys_memory_get_page_attribute SPU: SPURS oriented thread waiting Nov 2, 2025
@elad335
Copy link
Contributor Author

elad335 commented Nov 2, 2025

Added "SPURS oriented thread waiting" which is gonna replace "Preferred SPU Threads" setting and be active by default.

@elad335 elad335 force-pushed the gamer branch 2 times, most recently from cafffb4 to 8d667a9 Compare November 2, 2025 09:09
@elad335
Copy link
Contributor Author

elad335 commented Nov 2, 2025

Special settings tweaked from default:

Clocks Scale: 800%
Vblank Frequency: 480Hz
Frame Limit: Off
Preferred SPU Threads: Auto (as default)

PR:

image

Master

image

@digant73
Copy link
Contributor

digant73 commented Nov 2, 2025

performance are regressed on some games such as KZ3

PR:

kz3_pr

MASTER:

kz3_master


constexpr u32 _1m = 1u << 20;

std::unique_lock fast_lock(render->sys_rsx_mtx, std::defer_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the benefit of the double check under lock especially since we expect frame-to-frame the mappings wont actually change. Why not just lock and check once? I feel that would be faster here.

break;
}

const u64 current = get_system_time();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pointed it out before, but get_system_time is unreasonably heavy. Prefer TSC unless real-world precise values are required.

A general note - the spu_info logic (test_and_update_atomic_op_info) in general is quite heavy-handed with all the atomic ops and may eat into performance. The biggest issue I see is that there is no fast-path through this calling sequence (and the corresponding one below). Yes, spurs itself is going to be almost always running task groups but we also observe that in most games the parallel misses themselves aren't too bad with modern processors, though I agree we need something more sophisticated than the quick hack that was the preferred threads option.
This is all theory of course, we'll just have to see if it ends up worth the overhead with the big hitters like RDR, TLOU or killzone titles.


spu_info[index].release(info);

for (usz i = 0; i < spu_info.size(); i++)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can abuse vector ops for this sequence and gain implicit atomicity.
Have the spu info as an object of arrays instead of array of objects.
Then you can just load all of them at once and (ab)use vector ops on the vector to figure out how much overlap there is.

On x86 at least, vector ops are atomic as long as they are naturally aligned too so we basically get that for free.

@MsDarkLow
Copy link
Contributor

i9-13900K | RTX 3080
Ratchet & Clank Tools of Destruction - Improved a little bit. 88.3 -> 92.5
God of War 3 - Very small Regression. 62.4 -> 61.0
Sonic Unleashed - Tremendously reduced performance. From a stable 42.4 to very unstable 32.4

Ratchet & Clank

ratchet tod

God of War 3

gow3

Sonic Unleahsed

sonic u

@elad335 elad335 changed the title SPU: SPURS oriented thread waiting SPU: Update CELL Communication Performance Nov 6, 2025
@elad335
Copy link
Contributor Author

elad335 commented Nov 6, 2025

I've pushed an experimental update, please test. If it works I'll put it under a special setting.

@MsDarkLow
Copy link
Contributor

MsDarkLow commented Nov 6, 2025

i9-13900K | RTX 3080
After these updates, all four games I've tested before have improved!
Ratchet & Clank Tools of Destruction - 88.3 -> 96.0 (earlier in the pr was at 92.0)
God of War 3 - 62.0 -> 64.0 (earlier in the pr it was 61.0)
Sonic Unleashed - 42.4 -> 43.4 (earlier in the pr it was 32.4)
Metal Gear Soils 4 - 95.4 -> 96.3

Ratchet & Clank

ratchet tod - pt2

God of War 3

gow3 - pt2

Sonic Unleahsed

sonic u - pt2

Metal Gear Soild 4

MGS4 - pt2

@digant73
Copy link
Contributor

digant73 commented Nov 6, 2025

kz3 is still very bad (40 fps vs 69). GPU usage in particular is very low.

EDIT: attached also the log

RPCS3.zip

kz3_pr_1

@digant73
Copy link
Contributor

digant73 commented Nov 6, 2025

bad performance also in infamous 2 demo (50 fps vs 80). the game also crashes with the following error:

PR:

image image

MASTER:

image

std::lock_guard lock(g_camera.mutex);

*info = g_camera.info;
CellCameraInfoEx info_out;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed?
Zero comments

Copy link
Contributor Author

@elad335 elad335 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't touch VM memory under mutex for few reasons. (RSX access violations lengthens the duration of the lock for example)
We can put it in the coding guidelines.
There is no need to comment it each time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then a wrapper construct makes more sense, otherwise it will be repeated again elsewhere. Or maybe unlocked probe_for_read / probe_for_write makes more sense like usually done in real drivers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bugfix CPU Optimization Optimizes existing code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Regression] Performance Regressions in Hot Shots Golf: Out of Bounds (#11904, #12523)

5 participants